Efficiently Finding Near Duplicate Figures in Archives of Historical Documents

نویسندگان

  • Thanawin Rakthanmanon
  • Qiang Zhu
  • Eamonn J. Keogh
چکیده

The increasing interest in archiving all of humankind’s cultural artifacts has resulted in the digitization of millions of books, and soon a significant fraction of the world’s books will be online. Most of the data in historical manuscripts is text, but there is also a significant fraction devoted to images. This fact has driven much of the recent increase in interest in query-by-content systems for images. While querying/indexing systems can undoubtedly be useful, we believe that the historical manuscript domain is finally ripe for true unsupervised discovery of patterns and regularities. To this end, we introduce an efficient and scalable system that can detect approximately repeated occurrences of shape patterns both within and between historical texts. We show that this ability to find repeated shapes allows automatic annotation of manuscripts, and allows users to trace the evolution of ideas. We demonstrate our ideas on datasets of scientific and cultural manuscripts dating back to the fourteenth century. Keywords-component, cultural artifacts, duplication detection, repeated patterns

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compact Features for Detection of Near-Duplicates in Distributed Retrieval

In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector, a compact document representation t...

متن کامل

Organizing News Archives by Near-Duplicate Copy Detection in Digital Libraries

There are huge numbers of documents in digital libraries. How to effectively organize these documents so that humans can easily browse or reference is a challenging task. Existing classification methods and chronological or geographical ordering only provide partial views of the news articles. The relationships among news articles might not be easily grasped. In this paper, we propose a near-du...

متن کامل

A Survey of Duplicate And Near Duplicate Techniques

--World Wide Web consists of more than 50 billion pages online. The advent of the World Wide Web caused a dramatic increase in the usage of the Internet. The World Wide Web is a broadcast medium where a wide range of information can be obtained at a low cost. A great deal of the Web is replicate or nearreplicate content. Documents may be served in different formats: HTML, PDF, and Text for diff...

متن کامل

Redundancy Control in Web Archives

Large scale text collections like web archives evolve over time. However, the addition of new documents does not always add novel content, but also introduces contents that are copied, enriched, or recompiled from already existing documents. Thus, such collections are characterized by a lot of redundant content. Redundant documents waste storage space, make content analysis difficult and decrea...

متن کامل

Visual enhancement of old documents with hyperspectral imaging

Hyperspectral imaging (HSI) of historical documents is becoming more common at national libraries and archives. HSI is useful for many tasks related to document conservation and management as it provides detailed quantitative measurements of the spectral reflectance of the document that is not limited to the visible spectrum. In this paper, we focus on how to use the invisible spectra, most not...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of Multimedia

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2012